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ABSTRACT 

A brief overview of item response theory is provided, 
and a 186-item bibliography of books and articles on the subject 
dating from 1953 to June 1989 is presented. The overview includes a 
definition of the theory, a discussion of its development and 
application, and comparisons with classical test theory. All 
publications in the bibliography were issued in the United States. 
The bibliography is organized into 13 categories, as follows: general 
articles/texts, models, parameter estiination, model fit, scales, 
robustness studies, test development studies, adaptive testing 
studies, item banking studies, equating studies, item bias studies, 
miscellaneous applications » and computer programs. (TJH) 
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In a few words, item response theory (IRT) postulates that (a) 
examinee test performance can be predicted (or explained) by a set of 
factors called traits, latent traits, or abilities, and (b) the 
relationship between examinee item performance and these traits can be 
described by a monotonically increasing function called an item charac- 
teristic function . This function specifies that examinees with higher 
scores on the traits have higher expected probabilities for answering 
an item correctly than examinees with lower scores on the traits. In 
applying item response theory to measurement problems, a common 
assumption is made that there is one dominant factor or ability which 
can account for item performance. This so-called "ability" which the 
test measures could be a broadly or narrowly defined aptitude, achieve- 
ment, or personality variable. 

In the one-trait or one-dimensional model, the item characteristic 
function is called an item characteristic curve lICC) and it provides 
the probability of examinees answering an item correctly for examinees 
at different points on the ability scale defined for the trait measured 
by the test. Modifications are made in the interpretations of ICCs 
when, for example, the underlying trait is an attitudinal ''ariable and 
the "item response" is a rating from (say) a Likert scale. In addition 
to tiie assumption of test unidimensionality, it is common to assume 
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that the item characteristic curves are described by one, two, or three 
parameters. The specification of the mathematical form of the ICCs and 
the corresponding number of parameters needed to describe the curves 
determines the particular item response model. Generating and/or 
selecting mathematical forms for ICCs are two of the currently 
important lines of research in the IRT field. 

In any successful application of item response theory, item para- 
meter estimates are obtained to describe the test items, and ability 
estimates are obtained to describe the performance of examinees. Any 
successful application requires that there be evidence that the chosen 
item response model, at least to an adequate degree, fits the test 
dataset . 

Item response theory (IRT) (or latent trait theory, or item 
characteristic curve theory, as it is sometimes called) has become over 
the last 20 years a very popular topic in the measurement field. There 
have been (1) numerous IRT research studies published in the measure- 
ment journals, (2) a very large number of conference presentations, and 
(3) many successful applications of the theory to pressing measurement 
problems (i.e., test score equating, study of item bias, cest develop- 
ment, item banking, and adaptive testing). 

Interest in item response theory stems from two desirable features 
which are obtaxn^^d when an item response model fits a test dataset: 
Descriptors of test items (the item statistics) are not dependent upon 
the particular sample of examinees chosen from the population of 
examinees for whom the test items are intended, and the expected 
examinee ability scores do not depend upon the particular choice of 
items from the total pool of test items to which the item response 



model has beer applied. Invariant item and examinee ability para- 
meters, as they are called, are of immense value to measurement 
specialists. Neither desirable feature is obtained when the >(ell-known 
and popular classical test models are used. 

There are many well-documented shortcomings of classical testing 
methods and measurement procedures. The first shortcoming is that the 
values of such classical item statistics as item difficulty and item 
discrimination depend on the particular examinee samples in which they 
are obtained. The average level of ability and the variability of 
ability scores in an examinee group influence the values of the item 
statistics, and reliability and validity statistics too, often 
substantially. One undesirable consequence of sample dependent item 
statistics is that these item statistics are only useful when 
constructing tests for examinee populations which are very similar to 
the sample of exaiTiinees in which the item statistics were obtained. 

A second shortcoming of classical testing methods and procedures 
is that comparisons of examinees on an ability scale measured by a set 
of test items comprising a test are limited to situations where 
examinees are administered the sane (or parallel) test items. 
Unfortunately, many achievement and aptitude tests are (typically) 
suitable for middle-ability students only and so these tests do not 
provide very precise estimates of ability for either high- or low- 
ability examinees. Increased test score validity without any increase 
in test length can be obtained, in theory, when the test difficulty is 
matched to the approximate ability levels of examinees. But, when 
several forms of a test which vary substantially in difficulty are 
used, the task of comparing examinees becomes more coRiplex because test 
scores only cannot be used* 



A third shortcoming of classical testing methods and procedures is 
tliat they provide no basis for determining what a particular examinee 
might do when confronted with a test item. Such information is 
necessary, for example, if a test designer desires to predict test 
score characteristics in one or more populations of examinees or to 
design tests with particular characteristics for certain populations of 
examinees. Also, when an adaptive test is being administered at a 
computer terminal, optimal item selection depends on being able to 
predict how the examinee will perform on various test items. 

Item response theory purports to overcome the shortcomings of 
classical test theory by providing an ability scale on which examinee 
abilities are independent of the particular choice of test items from 
the pool of test items over which the ability scale is defined. 
Ability estimates obtained from different item samples for an examinee 
will be the same except for measurement errors. This feature is 
obtained by incorporating information about the items (i.e., their 
statistics) into the ability estimation process. Also, item parameters 
are defined on the same ability scale. They ar^, in theory, 
independent of the particular choice of examinee samples drawn from the 
examinee pool for whom the item pool is intended although errors in 
item parameter estimation will be group dependent. Item parameter 
invariance is accomplished by defining the item characteristic curves 
(from which the item parameters are obtained) in a way that the under- 
lying ability distribution is not a factor in item parameter values or 
interpretations. Finally, by deriving standard errors associated with 
individual ability estimates, rather than producing a single estimate 
of error and applying it to all examinees, another of the criticisms of 
the classical test model can be overcome. 
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In summary, item response theory models provide both invariant 
item statistics and ability estimates. These features will be obtained 
when chere is a reasonable fit between the chosen model and the 
dataset. Through the parameter estimation process, test items and 
examinees are placed on an ability scale in such a way that there is as 
close a relationship as possible between the expected examinee 
probabilities for success on test items obtained from the estimated 
item and ability parameters and the actual performance of examinees 
positioned at each ability level. Item parameter estimates and exam- 
inee ability estimates are revised continually until the maximum agree- 
ment possible is obtained between predictions based on the ability and 
item parameter estimates and the actual test data. 

Today, item response theory is being used in the United States by 
most of the large test publishers, credentialing organizations, state 
departments of education, large school districts, the Armed Services, 
and industry to (1) construct both norm-referenced and criterion- 
referenced tests, (2) investigate item bias, (3) equate tests, and (4) 
report ability scores and diagnostic information. In fact, the various 
applications have been sufficiently successful that researchers in the 
IRT field have shifted their attention from a consideration of IRT 
model advantages and disadvantages in relation to classical test models 
to consideration of such IRT technical problems as goodness-of-f it 
investigations, model selection, parameter estimation, and steps for 
carrying out particular applications. Certainly some issues and 
technical problems remain to be solved in the IRT field but it would 
seem that item response model technology is more than adequate at this 
time to serve a variety of uses. 
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What follows is an IRT bibliography consisting mainly of 
important references (up to June of 1989) which have been published in 
the United States. No attempt was made to catalog the many important 
IRT articles which have appeared in European journals, or other non- 
American journals. The bibliography is organized into 13 categories: 
General Articles/Texts, Models, Parameter Estimation, Model-Fit, 
Scales, Robustness Studies, Test Development Studies, Adaptive Testing 
Studies, Item Banking Studies, Equating Studies, Item Bias Studies, 
Miscellaneous Applications, and Computer Programs. 
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